Internal Regret with Partial Monitoring 
Calibration-Based Optimal Algorithms 

Vianney Perchet* 
January 12, 2013 

Abstract 

We provide consistent random algorithms for sequential decision 
under partial monitoring, i.e. when the decision maker does not observe 
the outcomes but receives instead random feedback signals. Those 
algorithms have no internal regret in the sense that, on the set of 
stages where the decision maker chose his action according to a given 
law, the average payoff could not have been improved in average by 
using any other fixed law. 

They are based on a generalization of calibration, no longer defined 
in terms of a Voronoi' diagram but instead of a Laguerre diagram (a 
more general concept). This allows us to bound, for the first time in 
this general framework, the expected average internal - as well as the 
usual external - regret at stage n by 0{n^^^^), which is known to be 
optimal. 

Key Words : Repeated games. On-line learning. Regret, Partial 
Monitoring, Calibration, Voronoi and Laguerre Diagrams 

Hannan [l7j introduced the notion of regret in repeated games: a player 
(that will be referred as a decision maker or also a forecaster) has no external 
regret if, asymptotically, his average payoff could not have been greater if 
he had known, before the beginning of the game, the empirical distribution 
of moves of the other player. Blackwell [6] showed that the existence of 
such externally consistent strategies, first proved by [l7j, is a consequence 
of his approachability theorem. A generalization of this result and a more 
precise notion of regret are due to Foster & Vohra [13] and Fudenberg &: 
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Levine |16j : there exist internally consistent strategies, i.e. such that for any 
of his action, the decision maker has no external regret on the set of stages 
where he actually chose this specific action. Hart & Mas-Colell [18] also 
used Blackwell's approachability theorem to construct explicit algorithms 
that bound the internal (and therefore the external) regret at stage n by 



Some of those results have been extended to the partial monitoring 
framework, i.e. where the decision maker receives at each stage a random 
signal, whose law might depend on his unobserved payoff. Rustichini [27] 
defined - and proved the existence of - externally consistent strategies, i.e. 
such that the average payoff of the decision maker could not have been 
asymptotically greater if he had known, before the beginning of the game, 
the empirical distribution of signals. Actually, the relevant information is a 
vector of probability distributions, one for each action of the decision maker, 
that is called a Bag. 

Some algorithms bounding optimally the expected regret by O (n~^/'^) 
have been exhibited under some strong assumptions on the signalling struc- 
ture - see Cesa-Bianchi & Lugosi [9], Theorem 6.7 for the optimality of this 
bound. For example, Jaksch, Ortner & Auer [20] considered the Markov de- 
cision process framework, Cesa-Bianchi, Lugosi &: Stoltz [10] assumed that 
payoffs can be deduced from flags and Lugosi, Mannor & Stoltz [23] that 
feedbacks are deterministic (along with the fact that the worst compati- 
ble payoff is linear with respect to the flag). When no such assumption is 
made, Lugosi, Mannor & Stoltz |23j provided an algorithm (based on the 



In this framework, internal regret was defined by Lehrer &: Solan |21j : 
stages are no longer distinguished as a function of the action chosen by the 
decision maker (as in the full monitoring case) but as a function of its law. 
Indeed, the evaluation of the payoff (usually called worst case) is not linear 
with respect to the flag. So a best response - in a sense to be defined - to a 
given flag might consist only in a mixed action (i.e. a probability distribu- 
tion over the set of actions). Lehrer & Solan |21j also proved the existence 
and constructed internally consistent strategies, using the characterization 
of approachable convex sets due to Blackwell [5]. Perchet [24] provided an 
alternative algorithm, recalled in section [T2| this latter is based on calibra- 
tion, a notion introduced by Dawid |12j . Roughly speaking, these algorithms 
e-discretize arbitrarily the space of flags and each point of the discretization 
is called a possible prediction. Then, stage after stage, they predict what 
will be the next flag and output a best response to it. If the sequence of 



0(n-i/2). 
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predictions is calibrated then the average flag, on the set of stages where a 
specific prediction is made, will be close to this prediction. 

Thanks to the continuity of payoff and signaling functions, both algo- 
rithms bound the internal regret hy e + O (n~^/^). However the first draw- 
back lies in their computational complexities: at each stage, the algorithm 
of Perchet [Mj solves a system of linear equations while the one Lehrer &: 
Solan [21j . after a projection on a convex set, solves a linear program. In 
both case, the size of the linear system or program considered is polynomial 
in e and exponential in the numbers of actions and signals. The second 
drawback is that the constants in the rate of convergence depend drastically 
on e. 

As a consequence, a classic doubling trick argument will generate an 
algorithm with a strongly sub-optimal rate of convergence - that might even 
depend on the size of the actions sets - and a complexity that increases with 
time. 

Our main result is Theorem 12.101 stated in section 12.31 it provides the 
first algorithm that bounds optimally both internal and external regret by 
O (n~^/'^) in the general case. It is a modification of the algorithm of Perchet 
|24] that does not use an arbitrary discretization but constructs carefully a 
specific one and then computes, stage by stage, the solution of a system 
of linear equations of constant size. In section 13.11 an other algorithm - 
based on Blackwell's approachability as the one of Lehrer & Solan [21] - 
with optimal rate and smaller constants is exhibited; it requires however to 
solve, at each stage, a linear program of constant size. 



Section 1 is devoted to the simpler framework of full monitoring. We 
recall definitions of calibration and regret and we provide a naive algorithm 
to construct strategies with internal regret asymptotically smaller than e. 
We show how to modify this algorithm - however in a not efficient way - 
in order to bound optimally the regret by O (n"^/'^). This has to be seen 
only as a tool that can be easily adapted with partial monitoring in order 
to reach the optimal bound of O (n~^/^); this is done in section 2. Some 
extensions (the second algorithm, the so-called compact case and variants to 
strengthen the constants) are presented in section 3. Some technical proofs 
can be found in Appendix. 
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1 Full monitoring 



1.1 Model and definitions 

Consider a two-person game T repeated in discrete time, where at stage 
n G N, a decision maker, or forecaster, (resp. the environment or Nature) 
chooses an action in £ I (resp. j„ G J^)- This generates a payoff pn = 
p{in,jn), where p is a mapping from X x ^7 to M, and a regret r„ G 
defined by: 



PihJn) - P{in,jn) 



G 



where / is the finite cardinahty of I (and J the one of J'). This vector 
represents the differences between what the decision maker could have got 
and what he actuahy got. 

The choices of i„ and j„ depend on the past observations (also called 
finite history) hn-i = {ii, ji, . . . ,in-i, jn-i) and may be random. Explic- 
itly, the set of finite histories is denoted hy H = IJneN (-^ ^ ^)"; with 
(I X J)^ = and a strategy a of the decision maker is a mapping from 
H to A(X), the set of probability distributions over I. Given the history 
hn £ {I X i7)", (y{hn) £ ■A (2^) is the law of in+i- A strategy r of Nature is 
defined similarly as a function from H to A(^7). A pair of strategies (o", r) 
generates a probability, denoted by P(t,t) over {71, A) where H = (I x J')^ 
is the set of infinite histories embedded with the cylinder cr-field. 

We extend the payoff mapping p to A(X) xA{J') by p{x, y) = ^x,y[p{i-, j)] 
and for any sequence a = (a„J,^gpj and any n G N*, we denote by a„ = 
n Ylm=i average of a up to stage 7i. 

Definition 1.1 (Hannan |17| ) A strategy a of the forecaster is externally 
consistent if for every strategy r of Nature: 

limsupf^ < 0, G X, Po-,t— as. 

In words, a strategy a is externally consistent if the forecaster could not have 
had a greater payoff if he had known, before the beginning of the game, the 
empirical distribution of actions of Nature. Indeed, the external consistency 
of a is equivalent to the fact that : 

limsup max p{x,jn) - Pn<0, Pcr,T-as. (1) 

Foster & Vohra [13] (see also Pudenberg & Levine |16j ) defined a more 
precise notion of regret. The internal regret of the stage n, denoted by 
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Rn G M^^^, is also generated by the choices of in and jn and its (i,A;)-th 
coordinate is defined by: 



j^ik ^ f P{k,jn) - P{hjn) if i = «n 

' otherwise. 



Stated differently, every row of the matrix Rn is null except the i^-th which 
is rn. 

Definition 1.2 (Foster & Vohra |13| ) A strategy a of the forecaster is 
internally consistent if for every strategy r of Nature: 

limsup^jf < Vi, A; G X, Pa,r-as. 

n— >oo 

We introduce the following notations to define e-internally consistency. 
Denote by Nn{i) the set of stages before the n-th where the forecaster chose 
action i and Jnii) G '^(^) the empirical distribution of Nature's actions on 
this set. Formally, 

Nn{i:} = {mG{l,...,n}; im = i} and Ui) = ^":l!'"^^.^'" € A{J). 

(2) 

A strategy is e-internally consistent if for every i,k £ I 



-as. 



limsnp^-^^^^( p{k,jn{i)) - p{ijn{i)) - e) < 0, P,,,- 

n—^oo n y J 

If we define, for every e > 0, the e-best response correspondence by : 

^-ReCy) = i a; G A(X); p{x,y)> max p(z, y) - e I , 

then a strategy of the decision maker is e-internally consistent if any action i 
is either an e-best response to the empirical distribution of Nature's actions 
on Nn{i) or the frequency of i is very small. We will simply denote BRq by 
BR and call it the best response correspondence. 

From now on, given two sequences G £, G K'^; m G N} where £, 
is a finite set, we will define the subset of integers Nn{l) and the average 
an{l) as in equation ([2]). 

Proposition 1.3 (Foster & Vohra [13] ) For every e > 0, there exist e- 
internally consistent strategies. 
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Although the notion of internal regret is a refinement of the notion of 
external regret (in the sense that any internally consistent strategy is also 
externally consistent), Blum & Mansour proved that any externally con- 
sistent algorithm can be efficiently transformed into an internally consistent 
one (actually they obtained an even stronger property called swap consis- 
tency). 

Foster & Vohra [13] and Hart & Mas-Colell [18] proved directly the ex- 
istence of 0-internally consistent strategies using different algorithms (with 
optimal rates and based respectively on the Expected Brier Score and Black- 
well's approachability theorem). In some sense, we merge these two last 
proofs in order to provide a new one — given in the following section — 
that can be extended quite easily to the partial monitoring framework. 

1.2 A nai've algorithm, based on calibration 

The algorithm (a similar idea was used by Foster & Vohra [13]) that con- 
structs an e-internally consistent strategy is based on this simple fact: if the 
forecaster can, stage by stage, foresee the law of Nature's next action, say 
y £ A(^), then he just has to choose any best response to y at the following 
stage. The continuity of p implies that the forecasts need not be extremely 
precise but only up to some 5 > 0. 

Let {?/(/); I £ £,} he a 5-grid of A(^) (i.e. a finite set such that for every 
y S A(J') there exists I £ C such that \\y — y{l)\\ < 5) and be a best 
response to y{l), for every I G C Then if 6 is small enough: 

\\y-y{l)\\<25^i{l)€BR2e{y) 

It is possible to construct a good sequence of forecasts by computing a 
calibrated strategy (introduced by Dawid [12] and recalled in the following 
subsection 11.2. ip . 

1.2.1 Calibration 

Consider a two-person repeated game Fc where, at stage n. Nature chooses 
the state of the world jn in a finite set J and a decision maker (that will 
be referred in this setting as a predictor) predicts it by choosing y{ln) in 
y = {y{l); I G C}, a finite (5-grid of A(J') - its cardinality is denoted by L. 
As usual, a behavioral strategy a of the predictor (resp. r of Nature) is a 
mapping from the set of finite histories H = IJnGN ^ <^)" A(£) (resp. 
A{J')). We also denote by P(j,t the probability generated by the pair (o", r) 
over {T-L,A) the set of infinite histories embedded with the cylinder topology. 
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Definition 1.4 (Dawid |12| ) A strategy a of the predictor is calibrated 
(with respect to y = {y{l); I & C}) if for every strategy r of Nature, ¥a,T-0'S: 



\Nnil)\ 



limsup^:^! ||j-„(/)-y(Of -||j„(0-2/(fc)f ) <o, \fk,lec, 

n— >oo ri 



where \\ ■ \\ is the Euclidian norm of . 



In words, a strategy is calibrated if for every I £ C, the empirical distribution 
of states, on the set of stages where y{l) was predicted, is closer to y{l) than 
to any other y{k) ( or the frequency of I, \Nn{l)\/n, is small). 

Given a finite grid of A{J'), the existence of calibrated strategies has 
been proved by Foster & Vohra [H] using either the Expected Brier Score 
or a minmax theorem (actually this second argument is acknowledged to 
Hart). We give here a construction, related but simpler than the one of 
Foster and Vohra, due to Sorin 1301. 



Proposition 1.5 (Foster & Vohra |14| ) For any finite grid y of /S.{J), 
there exist calibrated strategies with respect to y such that for every strategy 
T of Nature: 

1 



LkeC n 



< O 



Proof. Consider the auxiliary game where, at stage n E N, the predictor 
(resp. Nature) chooses In £ C (resp. j„, € J) and the vector payoff is the 
matrix Un G M^^-^ where 



jjlk 



y{i)\?-\\jn-ym? in = Zn 

otherwise. 



A strategy a is calibrated with respect to C if C/„ converges to the negative 
orthant. Indeed for every l,k £ C, the (Z, k)-th. coordinate of Un is 

|A^„(OI E™6jv„(o \\jm - y(Of - \\jm - y(^^"2 



JJlk 

^ 1-1 



n \Nnil)\ 
\Nn{l)\ 



n 



iij„(/)-y(or-iijn(o-y(^)ii^ 



Denote by := {max (O, Un)}i =: Un — Un the positive part of 
Un and by A„, € A(£) any invariant measure of Un - We recall that A is an 
invariant measure of a nonnegative matrix U if, for every I £ C, 



yx{k)u'' = Hi)yu' 
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Its existence is a consequence of Perron- Frobenius Theorem, see e.g. Seneta 

EHl. 



Define the strategy a of the predictor inductively as fohows. Choose 
arbitrarily o"(0), the law of the first action and at stage n + 1, play accord- 
ingly to any invariant measure of . We claim that this strategy is an 
approachability strategy of the negative orthant of M^^^ because it satisfies 
Blackwell [5]'s sufficient condition: 

Vn G N, (C7„ - U-,Ex„ [t/„+i|j„+i] - U') < 0. 

Indeed, for every possible jn+i G J^'. 



(3) 



where the second equality follows from the definition of positive and negative 
parts. 

Consider the first equality. The (/, A;)-th coordinate of E;s^^[?7„+i|j,i+i] is 
(llin+i - - llin+i " therefore the coefficient of \\jn+i - 



y{l)f in the first term is A„(0Efc6£ (^^n " T^kecU^) (f/+)". This 
equals since is an invariant measure of U^. 



Blackwell [5]'s result also implies that E^j^r 
strategy r of Nature where = sup^<„ Eo-,t 



\U+\\] < 2M„n-i/2 for 



\U„ 



4L. 



any 
□ 



Interestingly, the strategy a we constructed in this proof is actually 
internally consistent in the game with action spaces C and J and payoffs 
defined by p{l,j) = - - y{ 



Corollary 1.6 For any finite grid y of A{J'), there exists cr, a calibrated 
strategy with respect to y, such that for every strategy r of Nature, with ¥„^t- 
probability at least 1 — 6: 



max ( wui) - vim' - wui) - vim' )<^+en, 



n 
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where B„ 



sup \j^a,' 
m<n 



sup sup Eo-,r 



Ik 



sup sup 

m<n l,k(^C 



U: 



Ik 



< 3. 



< 3; 



Proof. Proposition 11.51 implies that Eo-,t \pn\ ^ 2M„n~^/^. Hoeffding- 
Azuma's inequahty (see Lemma 13.41 below in section 13.3. ip implies that 
with probability at least 1 — 5 : 



Tjlk _ Tg 



Ik 



< 



'21n 



Freedman's inequality (an analogue of Bernstein's inequality for martingale 
see Freedman [15], Proposition 2.1 or Cesa-Bianchi & Lugosi [9], Lemma 
A. 8) implies that with probability at least \ — 5 : 



Tjlk _ Tg 



u: 



Ik 



< 



'2 In 



n 



1\ 2Kn 1 



The result is a consequence of these two inequalities and of Proposition II. 5[ 

□ 

The definition of @n as a minimum (and the use of Freedman's inequal- 
ity) will be useful when we will refer to this corollary in the subsequent 

sections. Obviously, in the current framework, 0„, < -^4/2 In (^"x)' 



1.2.2 Back to the Naive Algorithm 

Let us now go back to the construction of e-consistent strategies in P. Com- 
pute (T, a calibrated strategy with respect to a (5-grid y = {y{l); I G C} 
of A(J') in an abstract calibration game F^. Whenever the decision maker 
(seen as a predictor) should choose the action / in Fc, then he (seen as a 
forecaster) chooses G BR{y[l)) in the original game F. We claim that 
this defines a strategy which is 2e-internally consistent. 
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Proposition 1.7 (Foster & Vohra |13| ) For every e > 0, the strategy a ^ 
described above is 2e-internally consistent. 

Proof. By definition of a calibrated strategy, for every ?? > 0, there exists 
with probabihty 1, an integer G N such that for every /,A; G £ and for 
every n> N : 



\Nn{l)\ 



n 



\Ui)-y{i)\?-\\Ui)-y{k)f <r? 



Since {y{k); A; G £} is a (5-grid of A(J'), for every I G C and every ra G N, 
there exists k ^ C such that ||jn(0 "2/(^)11^ ^ ^'^^ hence ||jn(0 ~ 2/(011^ ^ 
6'^ +VT7rm- Therefore, since G BR{y{l)): 



|A^n(0|- 

> 1 ^ ||I„,(0 - y{l)f < 25^ ^ Pikjnil)) - pm.Ul)) < 2e, 



n 6"^ 

for every k £ I, I £ C and n > N. The (i, A;)-th coordinate of Rn satisfies: 
^^fe-2e) < I H {pik,jm)-pii,jm)-2e 



ll (^p{k,jm) - p{i,jm) -2e^ 

l:i{l)=i meNn{l) 

E ^-^^(^p{kjn{i))-pm,ui))-2sy 



l:i{l)=i 

Recall that either l^^^M > and p{k,jn{i)) - p{i{l)jn{l)) - 2e < 0, or 
^-^^ < ^. Since p is bounded (by Mp > 0), then : 

^^(K'=-2e) <7?^^, V.GX,VA;GX,Vn>iV, 

which implies that a is 2e-internally consistent. □ 

Remark 1.8 This naive algorithm only achieves e-consistency and Propo- 
sition I j.5l implies that 



niax ( - £ 



The constants depend drastically on L, which is in the current framework 
in the order of e"^ , therefore it is not possible to obtain 0-internally con- 
sistency at the same rate with a classic doubling trick argument (i.e. use a 
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-internally consistent strategy on stages, then switch to a 2"^'^"^"'^^- 
internally consistent strategy, and so on, see e.g. Sorin 129], Proposition 3.2 
page 56). 

Moreover, since this algorithm is based on calibration, it computes at 
each stage an invariant measure of a non-negative matrix; this can be done, 
using Gaussian elimination, with O (L^) operations, thus this algorithm is 
far from being efficient (since its computational complexity is polynomial in 
e and exponential in J). There exist 0-internally consistent algorithms, see 
e.g. the reduction of Blum & Mansour that do not have this exponential 
dependency in the complexity or in the constants. 

On the bright side, this algorithm can be modified to obtain 0-consistency 
at optimal rate; obviously, it will still not be efficient with full monitoring 
(see section However, it has to be understood as a tool that can be 

easily adapted in order to exhibit, in the partial monitoring case, an optimal 
internal consistent algorithm (see section [273\) . And in that last framework, 
it is not clear that we can remove the dependency on L ( especially for the 
internal regret). 

1.3 Calibration and Laguerre diagram 

Given a finite subset of Voronoi' sites {z{l) G W^; / G £}, the l-th Voronoi' cell 
V{1), or the cell associated to z{l), is the set of points closer to z{l) than to 
any other z{k): 

V{1) = |z G M"^; \\Z - z{l)f < \\Z - z{k)f , VA; G £} , 

where || • || is the Euclidian norm of W^. Each V{1) is a polyhedron (as 
the intersection of a finite number of half-spaces) and {V{1); / G £} is a 
covering of M"'. A calibrated strategy with respect to {z{l); I G £} has 
the property that for every / G C, the frequency of / goes to zero, or the 
empirical distribution of states on Nn{l), converges to V{1). 

The naive algorithm uses the Voronoi diagram associated to an arbitrary 
grid of A(j7) and assigns to every small cell an e-best reply to every point 
of it; this is possible by continuity of p. A calibrated strategy ensures that 
jn{l) converges to V{1) (or the frequency of / is small), thus choosing i{l) 
on Nn{l) was indeed a e-best response to Jn(0- With this approach, we 
cannot construct immediately 0-internally consistent strategy. Indeed, this 
would require that for every I G C there exists a 0-best response i{l) to every 
element y in V{1). However, there is no reason for them to share a common 
best response because {z{l); Z G £} is chosen arbitrarily. 
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On the other hand, consider the simple game called Matching Penny. 
Both players have two action Heads and Tails, so A{J') = A(X) = [0,1], 
seen as the probability of choosing T. The payoff is 1 if both players choose 
the same action and -1 otherwise. Action H (resp. T) is a best response for 
Player 1 to any y in [0, 1/2] (resp. in [1/2, 1]). These two segments are exactly 
the cells of the Voronoi diagram associated to {y(l) = 1/4, y(2) = 3/4}, 
therefore, performing a calibrated strategy with respect to {y{l),y{2)} and 
playing H (resp. T) on the stages of type 1 (resp. 2) induces a 0-internally 
consistent strategy of Player 1. 

This idea can be generalized to any game. Indeed, by Lemma 11.101 
stated below, A(J') can be decomposed into polytopial best-response areas 
(a polytope is the convex hull of a finite number of points, its vertices). 
Given such a polytopial decomposition, one can find a finer Voronoi diagram 
(i.e. any best-response area is an union of Voronoi cells) and finally use a 
calibrated strategy to ensure convergence with respect to this diagram. 

Although the construction of such a diagram is quite simple in M, diffi- 
culties arise in higher dimension - even in M^. More importantly, the number 
of Voronoi sites can depend not only on the number of defining hyperplanes 
but also on the angles between them (thus being arbitrarily large even with a 
few hyperplanes). On the other hand, the description of a Laguerre diagram 
- this concept generalizes Voronoi diagrams - that refines a polytopial de- 
composition is quite simple and is described in Proposition 11.111 below. For 
this reason, we will consider from now on this kind of diagram (sometimes 
also called Power diagram) . 

Given a subset of Laguerre sites {z{l) G M'^; / G £} and weights {uj{l) G 
M; I G £}, the /-th Laguerre cell P{1) is defined by: 

P{1) = |z G R'^; \\Z - z{l)f - uj{l) < \\Z - z{k)f - uj{k), VA; G £} , 

where || • || is the Euclidian norm of M'^. Each P{1) is a polyhedron and 
-p = {P{1); / G £} is a covering of M^. 

Definition 1.9 A covering fC = {i^*; i £ X} of a polytope K with non- 
empty interior is a polytopial complex of K if for every i,j in the finite set 
I, is a polytope with non-empty interior and the polytope n has 
empty interior. 

This definition extends naturally to a polytope K with empty interior, if we 
consider the affine subspace generated by K. 
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Lemma 1.10 There exists a subset I' C I such that {B^; i G I'} is a 
polytopial complex of A{J'), where B^ is the i-th best response area defined 
by 

B' = {yG A{J)- i G BR{y)} = BR-\i). 

Proof. For any y G A{J'), p{-,y) is linear on A(X) thus it attains its 
maximum on I and IJiex-^* ~ ^(•^)- Without loss of generality, we can 
assume that each B^ is non-empty, otherwise we drop the index i. For every 
i,k £ Z, p{i,-) — p{k,-) is linear on A{J') therefore B^ is a polytope; it is 
indeed defined by 

B' = {yeA{jy,p{i,y)>p{k,y),ykeI} 

k£l 

so it is the intersection of a finite number of half-spaces and the polytope 
A{J). 

Moreover if B^^, the interior of B^ n -B^, is non-empty then p{i, •) equals 
p{k,-) on the subspace generated by Bq^ and therefore on A{J'); conse- 
quently B^ = B^. Denote by I' any subset of X such that for every f G X, 
there exists exactly one i' G X' such that B^ = B^ ^ 0, then {B^; i G X'} is 
a polytopial complex of A{J). □ 

Proposition 1.11 Let fC = {K^; i £ 1} be a polytopial complex of a 
polytope K C M^. Then there exists {z{l) G M^, oj{l) G M; Z G £}, a 
finite set of Laguerre sites and weights, such that the Laguerre diagram 
V = {P{1)', I G £} refines /C, i.e. every is a finite union of cells. 

Proof. Let IC = {K'; z G X} be a polytopial complex of K C R'^. Each 
is a polytope, thus defined by a finite number of hyperplanes. Denote by 
v. = {Ht; t G T} the set of all defining hyperplanes (the finite cardinality 
of T is denoted by T) and IC = {K'", I G £} the finest decomposition of 
induced hy H - usually called arrangement of hyperplanes - which by 
definition refines IC. Theorem 3 and Corollary 1 of Aurenhammer [2] imply 
that IC is the Laguerre diagram associated to some {z{l), uj{l); I G C} whose 
exact computation requires the following notation: 

i) for every t G T, let ct G R'^ and bt e R (which can, without loss of 
generality, be assumed to be non zero) such that 

Ht = [x€R''; {X,ct)=bt]. 
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ii) For every / G L and t ^T, crt{l) = 1 if the origin of M*^ and are in 
the same halfspace defined by Ht and (Jt{l) = — 1 otherwise. 

iii) For every I G C, we define : 

4,) = ^'.rM'y. .(0 = ||.(0f + 2 ^'^C""" - (1) 



Note that one can add the same constant to every weight uj{l). □ 

Buck j8j proved that the number of cehs defined by T hyperplanes in 
is bounded by Ylk=o (T) ~- 'Pi'^, d), where (^) is the binomial coefficient, T 
choose k. Moreover, T is smaller than I{I — l)/2 (in the case where each A"' 

has a non-empty intersection with every other polytope), so L < (p (^^,dj . 

If d > n, then (j){n,d) = 2^. Pascal's rule and a simple induction imply 
that, for every 7i,d E N, (j){n,d) < (n + 1)'^. Finally, for any n > 2d, by 
noticing that 

G) + (/i) + --- + (o) ^y^/ d r^v^ d 



f") -^\n-d^\ -^\n-d+\ 

\dJ m=0 ^ ^ m=0 ^ 

which equals ^l2d+i - ^+d, we deduce that (p{n,d) < {l+d)Q < (l+d)^. 

Lemma 1.12 Let V = {P{1); Z G £} be a Laguerre diagram associated to 
the set of sites and weights {z{l) E W^, lo{1) E M; / E £}. Then, there exists 
a positive constant Mp > such that for every Z E M'^ if 

\\Z - z{l)f -uj{l) <\\Z - z{k)f -Lo{k) + e, yi,keC (5) 

then d{Z,P{l)) is smaller than Mpe. 

The proof can be found in Appendix lA.lt the constant Mp depends on the 
Laguerre diagram, and more precisely on the inner products {ct,Ct'), for 
every t,t' £ T. 

1.4 Optimal algorithm with full monitoring 

We reformulate Proposition 11.51 and Corollary 11.61 in terms of Laguerre dia- 
gram. 
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Theorem 1.13 For any set of sites and weights {y{l) G K"^, a;(/) G M; Z G 
£} there exists a strategy a of the predictor such that for every strategy t of 
Nature: 



U, 



Ik 



uj,n 



\jn - y 



< o 



1 







where Uuj^n is defined by : 

- y(k)f - ooik)] \il = ln 

otherwise 



Corollary 1.14 For any set of sites and weights {y{l) G M"^, a;(Z) G M; Z G 
£}, there exists a strategy a of the predictor such that, for every strategy r 
of Nature, with P(j,r probability at least 1 — (5, and 1,1 G C: 



\Nn{l)\ 



n 



l|jn(0-y(Of -^(0 - \\Ui)-yik)f-^ik) 



2M„ ^ 



where M„ 



sup \ E^,T 



If/,. 



< 4Vl||(6,c) 



n 



L2\ 2K, 



6 J 3 n 



sup sup Eo-,T 
m<n l,k(iC 



u: 



Ik 



sup sup 

m<n l,k£C 

sup ||cf II + sup |6t|. 



jjik 



jjlk 



<4| 



Kr. 



n 



L2 



<4||(6,c)|| 



2 . 

oo ' 



ll(^',c)|U 

Such a strategy is said to be calibrated with respect to {y{l), <^(0i ^ ^ 



The proof are identical to the one of Proposition 11.51 and Corollary ll.6i We 
have now the material to construct our new tool algorithm: 

Theorem 1.15 There exists an internally consistent strategy cr of the fore- 
caster such that for every strategy r of Nature and every n G N, with Pq-.t 
probability greater than 1 — 6: 



maxR'^ < O 

i,k(iX 




(6) 
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Proof. The existence of a Laguerre Diagram {Y{1); I G £} associated to a 
finite set {y{l) G M"', u}{l) G M; / G £} that refines {S*; i G X} is imphed 
by Lemma 1 1.101 and Proposition II .111 So, for every I G i2, there exists 
such that C -B*^'). As in the naive algorithm, the strategy a of the 

decision maker is constructed through a strategy a cahbrated with respect 
to {y{l), / G £}. Whenever, accordingly to a, the decision maker (seen 

as a predictor) should play / in Fc, then he (seen as a forecaster) plays 
in r. 

If we denote by jn(0 the projection of jn(0 onto Y{1) then: 



< 



Z:i(0=' 

E 

l:i(l)=i 
+ 



l^^n,(OI 



n 



P{k,jn{l)) - p{k,Jn{l)) 



< 



p{i{l)Jn{l)) - P{i{l)jn{l)) 

y: ^f2M,ib~(o-i„ 



l:i(l)=i 



< (2MoMpL) max 



J„(/)-y(Or-a;(/) 



||j„(0-2/(A:)r-^(A:) 



where the second inequality is due to the fact that G BR{Jn{l)) and 
the third to the fact that p is Mp-Lipschitz. The fourth inequality is a 
consequence of Lemma 11.121 

Corollarv 1 1 . 141 vields that for every strategy r of Nature, with Pq-^t prob- 
ability at least 1 — 5: 

Nn{l) 



max ■ 

Lk n 



in{l) - y 



8\/L||(6, c) 



mb,c)\[ ' 



< 



+ 



n 



'2 In 



n 



L2 
5 

1/211 



therefore with VLq = lQMpMpL^/'^\\{h,c)\\oo and VLi = pM p ' -\\{b,c} \\oo 
one has that for every strategy of Nature and with probability at least 1 — 5: 



max R, 



•ik 



max ■ 

i,kGX 



n 



p{kjn{i)) - p{ijn{i)) ) <^ + ^j2\n(^^y 



n 
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□ 



Remark 1.16 Theorem \1.15\ is already well-known. The construction of 
this internally consistent strategy relies on Theorem \ 1.1 31 which is implied 
by the existence of internally consistent strategies... Moreover, as mentioned 
before, it is far from being efficient since L - that enters both in the compu- 
tational complexity and in the constant - is polynomial in . There exist 
efficient algorithms, see e.g. Foster & Vohra 113^ or Blum & Mansour 

However, the calibration is defined in the space of Nature 's action, where 
real payoffs are irrelevant; they are only used to decide which action is as- 
sociated to each prediction. Therefore the algorithm does not require that 
the forecaster observes his real payoffs, as long as he knows what is the best 
response to his information (Nature's action in this case). This is precisely 
why our algorithm can be generalized to the partial monitoring framework. 

The polytopial decomposition of A(J') induced by {bt, ct] t G T} is 
exactly the same as the one induced by {'yb{t), 7c(t); t € T} for any 7 > 0. 
Thus, by choosing 7 small enough, \\{b, c)||oo — and therefore the constants 
in Corollary 11.141 — can be arbitrarily small (i.e. multiplied by any 7 > 0). 

However, these two Laguerre diagrams are associated to the sets of sites 
and weights C{1) and £(7), where £(7) = {'yz{l), ^ijj{1) + 7^||2^(0lP ~ 
7||z(/)||; / G £}. If £(7) is used instead of C{1), then the constant Mp 
defined in Lemma 11.121 should be divided by 7. So, as expected, the con- 
stants in the proof of Theorem 1 1 . 1 5 1 do not depend on 7. From now on, we 
will assume that c)||oo is smaller than 1. 

2 Partial monitoring 

2.1 Definitions 

In the partial monitoring framework, the decision maker does not observe 
Nature's actions. There is a finite set of signals S (of cardinality S) such 
that, at stage n the forecaster receives only a random signal Sn ^ S. Its law 
is s{in, jn) where s is a mapping from I x J' to A(5), known by the decision 
maker. 



We define s from A{J) to A(5)^ by s{y) = [Ey [s{i,j)] ) G A(5)-^. 



Any element of A{S)^ is called a flag (it is a vector of probability distribu- 
tions over S) and we will denote by J- the range of s. Given a flag / in J^, 
the decision maker cannot distinguish between any different mixed actions 
y and y' in A(^) that generate /, i.e. such that s{y) = s{y') = f. Thus s 
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is the maximal informative mapping about Nature's action. We denote by 
fn = s(jn) the (unobserved) flag of stage n G M. 

Example 2.1 Label efficient prediction (Example 6.8 in Cesa-Bianchi & 
Lugosi ISi): 

Consider the following game. Nature chooses an outcome G or B and the 
forecaster can either observe the actual outcome (action a) or choose to not 
observe it and pick a label g or b. His payoff is equal to 1 if he chooses the 
right label and otherwise is equal to 0. Payoffs and laws of signals are defined 
by the following matrices (where a, b and c are three different probabilities 
over a finite given set S ). 





G 


B 




G 


B 














a 


b 


Payoffs: g 





1 


and signals: g 


c 


c 


b 


1 





b 


c 


c 



Action G, whose best response is g, generates the flag (a, c, c) and action 
B, whose best response is b, generates the flag {b,c,c). In order to distin- 
guish between those two actions, the forecaster needs to know s{o,y) although 
action o is never a best response (but is purely informative). 

The worst payoff compatible with x and / G is defined by: 

W{xJ)= inf p{x,y), (7) 

yGs-i(/) 

and W is extended to A(5)^ by W{x,f) = W {x,Ylr{f)). 

As in the full monitoring case, we define, for every e > 0, the e-best 
response multivalued mapping BR,, : A(5)^ ^ ^{'^) by : 

= L G A(X); W{xJ)> sup W{zJ)-e\. 

y zeA{i) J 

Given a flag / G A(5)^, the function W{-, f) may not be linear so the best 
response of the forecaster might not contain any element of X. 

Example 2.2 Matching Penny in the dark: 

Consider the Matching Penny game where the forecaster does not observe 
the coin but always receives the same signal c: every choice of Nature gener- 
ates the same flag (c, c). For every x G [0, 1] = A({i?, T}) - the probability 
of playing T -, the worst compatible payoff W{x, (c, c)) = minyg/^(j) p{x,y) 



18 



is equal to — 11 — 2x\ thus is non-negative only for x = 1/2. Therefore the 
only best response of the forecaster is to play + while actions H and 
T give the worst payoff of -1. 

The definition of external consistency and especially equation ([T]) extend 
naturally to this framework: a strategy of the decision maker is externally 
consistent if he could not have improved his payoff by knowing, before the 
beginning of the game, the average flag: 

Definition 2.3 (Rustichini |27| ) A strategy a of the forecaster is exter- 
nally consistent if for every strategy r of Nature: 

limsup max W{z, fn) — Pnl^^-, ^a,T-o.s. 
n-s>+oo 2eA(X) 

The main issue is the definition of internally consistency. In the full 
monitoring case, the forecaster has no internal regret if, for every i £ I, the 
action i is a best-response to the empirical distribution of Nature's actions, 
on the set of stages where i was actually chosen. In the partial monitoring 
framework, the decision maker's action should be a best response to the 
average fiag. Since it might not belong to I but rather to A(X), we will 
(following Lehrer & Solan j2T]) distinguish the stages not as a function of 
the action actually chosen, but as a function of its law. 

We make an extra assumption on the characterization of the forecaster's 
strategy: it can be generated by a finite family of mixed actions {x{l) G 
A(X); / G £} such that, at stage n G N, the forecaster chooses a type In 
and, given that type, the law of his action z„ is G A(X). 

Denote by Nn{l) = {m G {l,...,n}; l^ = 1} the set of stages before 
the n-th whose type is I. Roughly speaking, a strategy will be e-internally 
consistent (with respect to the set C) if, for every / G C, x{l) is an e-best 
response to fn{l), the average fiag on Nn{l) (or the frequency of the type I, 
\Nn{l)\/n, converges to zero). 

The finiteness of C is required to get rid of strategies that trivially insure 
that every frequency converges to zero (for instance by choosing only once 
every mixed action). The choice of {x{l); I G C} and the description of the 
strategies are justified more precisely below by Remark 12.71 in section [231 

Definition 2.4 ( Lehrer & Solan [21j ) For every n G N and every I G 
C, the average internal regret of type I at stage n is 

lZn{l)= sup [W{xJn{l))-Pn{l)\. 
a:6A(J) 
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A strategy a of the forecaster is {C,e) -internally consistent if for every 
strategy r of Nature: 



In words, a strategy is {C, e)-internally consistent if, for every / G C, the 
forecaster could not have had, for sure, a better payoff (of at least e) if he 
had known, before the beginning of the game, the average flag on Nn{l) (or 
the frequency of / is small). 

2.2 A naive algorithm 

Theorem 2.5 ( Lehrer & Solan |21] ) For every e > 0, there exist {C,e)- 
internally consistent strategies. 

Lehrer &: Solan [21] proved the existence and constructed such strategies 
and an alternative, yet close, algorithm has been provided by Perchet |24j . 
The main ideas behind them are similar to the full monitoring case so we 
will quickly describe them. For simplicity, we assume in the following sketch 
of the proof, that the decision maker fully observes the sequence of flags 



Recall that W is continuous (see Lugosi, Mannor & Stoltz [23], Propo- 
sition A.l), so for every e > there exist two finite families Q = {f{l) S 
A(5)^; / G £}, a (5-grid of A(5)^ and X = {x{l) G A(/); / G £} such that 
if / is (5-close to /(/) and x is (5-close to x{l) then x belongs to BRf, (/). A 
calibrated algorithm ensures that: 

i) fn{l) is asymptotically (5-close to f{l) - because it is closer to f{l) than 
to every other f{k); 

ii) in{l) converges to x{l) as soon as |A^n(OI is big enough - because on 
Nn{l) the choices of action of the decision maker are independent and 
identically distributed accordingly to x{l); 



iii) Pn{l) converges to p{x{l),jn{l)) which is greater than W (x{l), fn{l) 



Therefore, W (x{l), fn{l)) is close to W (x{l), f{l)] which is greater than 



W iz, f{l)] for any z G A(X). As a consequence Pn{l) is asymptotically 




/„ = s(j„) G A{sy. 



because jn(0 generates the flag /n(0- 
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greater (up to some e > 0) than sup ^ W yz, fn{l)j , as long as \Nn{l)\ is big 
enough. 

The difference between the two algorithm lies in the construction of a 
calibrated strategy. On one hand, the algorithm of Lehrer & Solan [21j 
reduces to Blackwell's approachability of some convex set C C E^^-^; it 
therefore requires to solve at each stage a linear program of size polynomial 
in e*^^, after a projection on C. On the other hand, the algorithm of Perchet 
[21] is based on the construction given in section [1. 2. H it solves at each stage 
a system of linear equation of size also polynomial in e^^ . 

The conclusions of the full monitoring case also apply here: these highly 
non-efficient algorithms cannot be used directly to construct (£, 0)-internally 
consistent strategy with optimal rates since the constants depend drasti- 
cally on £ . We will rather prove that one can define wisely once for all 
{/(/), I G £} and {x{l); I £ C} (see Proposition 12.61 and Proposition 

II. lip so that x{l) E A(X) is a 0-best response to any fiag / in P{1), the 
Laguerre cell associated to f{l) and uj{l). 

The strategy associated with these choices will be {C, 0)-internally con- 
sistent, with an optimal rate of convergence and a computational complexity 
polynomial in L. 

2.3 Optimal algorithms 

As in the full monitoring framework (cf Lemma ll.lOp . we define for every 
X G A(X) the x-best response area as the set of flags to which x is a best 
response : 

B== = {f £ A{SY; X G BR{f)] = BR-\x). 

Since W is continuous, the family {B^; x G is a covering of A{S)^ . 

However, one of its finite subsets can be decomposed into a finite polytopial 
complex: 

Proposition 2.6 There exists a finite family X = {x[l) G A(X); I G £} 
such that the family jS^'^'^; Z G £| of associated best response area can he 
further subdivided into a polytopial complex of A{S)^ . 

The rather technical proof can be found in Appendix IA.2[ In this framework 
and because of the lack of linearity of W, any iJ^'W might not be convex nor 
connected. However, each one of them is a finite union of polytopes and the 
family of all those polytopes is a complex of A(5)^. 
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Remark 2.7 As a consequence of Proposition [27SI there exists a finite set 
X C A(X) that contains a best response to any flag f . In particular, if the 
decision maker could observe the flag fn before choosing his action x„ then, 
at every stage, x„ would be in X. So in the description of the strategies 
of the forecaster, the finite set {x{l); I £ C} = X is in fact intrinsic i.e. 
determined by the description of the payoff and signal functions. 

As a consequence of this remark, mentioning C is irrelevant; so we will, 
from now on, simply speak of internally consistent strategies. 

2.3.1 Outcome dependent signals 

In this section, we assume that the laws of the signal received by the decision 
maker are independent of his action. Formally, for every i,i' € I, the two 
mappings s(z, •) and s{i' , •) are equal. Therefore, T (the set of realizable 
flags) can be seen as a polytopial subset of A(5). Proposition 12.61 holds in 
this framework, hence there exists a finite family {x{l); I € £} such that 
for any flag f £ J-', there is some / G C such that x{l) is a best-reply to /. 
Moreover, for a fixed I G C, the set of such flags is a polytope. 

Theorem 2.8 There exists an internally consistent strategy a such that for 
every strategy r of Nature, with ^^^-y -probability at least 1 — 6: 



Proof. Propositions 11.111 and 12.61 imply the existence of two finite families 
{x{l); I G £} and {/(/), oj{l); I G £} such that x{l) is a best response to 
any / in P{1)., the Laguerre cell associated to /(/) and uj{l). Assume, for the 
moment, that for any two different / and k in £, the probability measures 
x{l) and x{k) are different. 

The strategy a is defined as follows. Compute a strategy a calibrated 
with respect to {/(Oi '^(0; ^ ^ When the decision maker (seen as a pre- 
dictor) should choose I G C accordingly to a, then he (seen as a forecaster) 
plays accordingly to x{l) in the original game. Corollarv 11.141 (with the as- 
sumption that 11(6, c) I loo is smaller than 1) implies that with Pq-.t probability 
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at least 1 — 5i: 

\Nn{l)\ 

max 

ze£ n 



Sn(0-/(Of -^(0 - \\Sn{l)-f{k)f-^{k) 



< 



8y/L 4 / /L2\ 



therefore combined with Lemma ll.l2l this yields that : 

\Nn{l)\ 



max ■ 



s„(0 - f,{l) 



(9) 



l&C n 

where /n(0 is the projection of s„(/) onto P{1) 



Hoeffding-Azuma's inequality implies that with P(j,r probability at least 
1-^2: 



max 



l^^n(OI 



Sn{l) - Ul) 



lec n 

and with probability at least 1 — : 



< 



n 



(10) 



\Nn{l)\ 



max ■ 



- p{x{l),Jn{l)) 



< M, 



21nlf 



n 



(11) 



is Mvy-Lipschitz in / (see Lugosi, Mannor & Stoltz [23]) and s (jn(/)) 
/„(/) therefore: 



Pn(/) > W[xil),fnil)) - -pn{l) - P{x{l)jn{l)) 

and max^gA(j) 1^ (x, fn{l)) is smaller than 



Ui)-Ui) (12) 



max x,/„(0 s„(0 - /„(/) + s„(0 - /„(0 

xeA(X) V / V 

= W{x{l), fn{l)) + Mw ( Sn{l) - fn{l) + Sn{l) " Jn{l) 



(13) 



since x{l) is a best response to /«(?)• Equations (fT2|) and (fT3|) yield 



nn{l)<2Mw Sn{l)-fn{l) 



+ 2Mw 



Sn{l)-fn{l) + Pn{l)-p{x{l),Ul)) 

(14) 



23 



Combining equations (jlOp . (jlip and (|14p gives that with probabiUty at 
least 1 - (5, if we define Qq = WMpMwVL, = {2Mw + SMwMp + Mp) 
and n2 = L{L + 2S + 2): 



sup 7^„(/)<^ + ^W21n — ^ (15) 



If there exist / and k such that x{l) = x{k), then although the decision 
maker made two different predictions /(/) or f{k), he played accordingly to 
the same probability x{l) = x{k). Define Nn{l, k) as the set of stages where 
the decision maker predicts either f{l) or f{k) up to stage n, fni}-, k) as the 
average flag on this set, pn{l,k) as the average payoff and Tln{l,k) as the 
regret. Since W{x, •) is convex for every x £ A(X), then max^-g^j-j) W{x, •) 

is also convex so ^^"^^'^"^^ ^^^xeA{x) W{x, fn{U k)) is smaller than 
max ^(x, /„(/)) + max W{x,Uk)) 



n x&A{X) n x£A(X) 

\Nn{l,k)\ _ \Nn{l)\ _ \Nnik)\ _ 

and p^[l^k) = pnil) Pn[k) 

n n n 



so we still have 



n 




Hence the previous bound holds up to a factor L. □ 

Remark 2.9 Lugosi, Mannor & Stoltz [2^ have constructed an externally 
consistent strategy, i.e. such that, asymptotically, for any strategy r of Na- 
ture: 

Pn> max WizJn), P<j,r-as. 

The final argument in the proof of Theorem \2.8\ also implies that an inter- 
nally consistent strategy is also externally consistent, hence we can compare 
bounds between our algorithm. 

If the signals are deterministic, Lugosi, Mannor & Stoltz 's efficient 
algorithm has an expected regret smaller than O (n~^/^) . However this bound 
became, with random signals, O (n""^/"*) . Thus our algorithm, along with 
computing no internal regret, has a better rate of convergence - the optimal 
one. Concerning the computational complexity, the true purpose of this al- 
gorithm being the minimization of internal regret, it is not efficient to bound 
external regret. 
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2.3.2 Act ion- Out come dependant signals 

In this section, we consider the most general framework and we assume that 
the laws of the signals might depend on the decision maker's actions. Our 
main result is the following: 

Theorem 2.10 There exists an internally consistent strategy a such that, 
for every strategy r of Nature, with PCT,r probability at least 1 — 6: 

^|ax^^7^„(0 < O { -^^/in ( t ) + zIt? In ( ^ ) ) . (16) 

Proof. The proof is essentially the same as the one of Theorem \2.ii\ so we 
can assume that x(l) ^ x{k) for any two different / and k in C. The only 
difference is due to the fact that at stage n G N, the unobserved flag has 
to be estimated (see e.g. Lugosi, Mannor & Stoltz [23]). 

Following Auer, Cesa-Bianchi, Freund & Schapire [1], we define for every 
I € C and n G N, the 7„-perturbation of x{l) by x(l, n) = (1 — 7„)a;(Z) + 7„tt 
where u is the uniform probability over X and (7n)neN is a non-negative 
non-increasing sequence. For every n S N, let 




xifn 1 '^) \in 

where x{ln, n)[in] > 7n = In/ 1 > is the weight put by x{ln,n) on i„. With 
this notation, is an unbiased estimator of fn since K^j^t [cnl/i""^] = /n, 

seen as an element of (IR'^)^. 

We define now the strategy of the forecaster. Assume that in an auxiliary 
game Fc, a predictor computes a, a calibrated strategy with respect to 
{f{l), I G C}, but where the state at stage n is the estimator e„, G . 

When the decision maker (seen as a predictor) should choose accordingly 
to cj in Fc, then he (seen as a forecaster) chooses in accordingly to x{ln) in 
the original game. 

In order to use Corollarv 11.141 we need to bound Vn, Mn and Kn- In 
the current framework and thanks to Proposition 11.111 one has for every 
Lk € C and n G M: 
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so using the fact that ||(&, c)||^ = 1 and the definition of e„: 

2 



sup sup Ecr,- 
l.keC m<n 



u: 



Lk 



< 16E^ . 



Ex(lr,,n)\i] I 



In 



As a consequence, Kn < 4-^, Vn < 4,/-^ and M„ < 4,/^. Lemma 11.121 

/n V in v /n 



imphes that, with Pa,T probabihty at least (1 — for every Z G £: 



\Nn{l)\ 



n 



en{l) - fn{l) 



< 



/21n 



L2 \ 8 Mp , 
+ - — - In 



5i y 3 7„n 



L2 

<5i 



where /n(0 is the projection of e„(/) onto P{1). 

Fohowing Lugosi, Mannor & Stoltz [23], since for every i £ I and s G 5, 

Eo-,T |en**P < l/7n5 Freedman's inequahty imphes that with probabihty at 

least 1 — 52, for every / G C 



n 



en{l) - fn{l) 



< VIS 



/2LIS\ 



+ 



f 2LIS \\ 



In 



Hoeffding-Azuma's inequality imphes that with probability at least 1 — 6^: 



Nn{l) 



max ■ 

l£C n 



Pnil) - p{x{l),Jn{l)) 



< Mr, 



n 



and by taking 7„ = n ^Z^, one has I]me7Vn(/)^™ - t'^^^^- 
quence, for every I £ C, with probability at least 1 — 6: 



a conse- 



n 



n 



1/3 „l/3 



'21n 



n 



1/2 ■ 



'21n 



2^5 \ 2 ^4 



2J],r 



with the constants defined by = IGMpMyyv LI + SM^MpI, = 
2Mh/\/7 (sMp + , 1^3 = Mp, = 2Mvy(4Mp + /FS) and J^g = 
L{L + 2 + 2IS). They can be decreased if concentration inequalities in 
Hilbert spaces are used (see section I3.3P . □ 

In the label efficient prediction game defined in Example 12.11 for every 
strategy a of the decision maker there exists a sequence of outcomes such 
that the forecaster expected regret is greater than (see Theorem 5.1 

in Cesa-Bianchi, Lugosi & Stoltz |10|). Therefore the rate of n^^/^ of our 
algorithm is optimal for both internal and external regret. 
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The computational complexity of this internally consistent algorithm is 
polynomial in L. Thus it can be seen, in some sense, as an efficient one. A 
question left open is the existence of an algorithm whose computational com- 
plexity is polynomial in the minimal number of best-response areas required 
to cover A(5)^, see Proposition 12.61 

The following section 13.11 deals with a simpler question and exhibits an 
internally consistent algorithm which requires to solve at each stage a linear 
program of size polynomial in Lq, the minimal number of polytopes on which 
BR is constant, instead of a system of linear equations of size L. 

3 Concluding remarks 

3.1 Second algorithm: calibration and polytopial complex. 

The algorithms we described are quite easy to run stage by stage since the 
forecaster only needs to compute some invariant measures of non-negative 
matrices. However, they require to construct the Laguerre diagram V = 
{P{1)', I G C} given the set {6^, q; t G T}. And we have shown that L, 
which is a factor both in the complexity of the algorithms and in their rate 
of convergence, can be in the order of T^^ hence polynomial in L^^ . 

This section is devoted to a modification of the algorithm that does not 
require to compute a Laguerre diagram but which is more difficult, stage by 
stage, to implement. The only difference between the two algorithms is in 
the definition of calibration. 

Let {K{1); I G Cq} be a finite polytopial complex of A(J'). It is defined 
by two finite families {q G R-^, fet G R; t G T} and {?"(/) C T; / G £} such 
that: 



Let us define {ct^i,bt^i) = {ct,bt) if t G T{1) and {ct^uht^i) = (0,0) otherwise. 
Then we can rewrite K{1) = {y G A(J'); (y, Ct^i) < 6t,/, Vt G T}. 

Definition 3.1 ^ strategy a is calibrated w.r.t. the complex {K{1); I G Co} 
if for every strategy r of Nature, Wa^r-os: 



Theorem 3.2 There exist calibrated strategies w.r.t. any finite polytopial 
complex {K{1); I G Cq}. 



K{1) = {y G A( J); (y, q) < bt, Vt G Til) C T} , W G £0- 



lim sup 




(jn(0, ct,/> - Ki < 0, yt£Tyie Co. 
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Proof. Consider the following auxiliary two-person game F^, where at stage 
n G M the predictor (resp. Nature) chooses /„, € Co (resp. j„ G J) which 
generates the vector payoff Un S IR^^° defined by: 



Any strategy that approaches the negative orthant in is calibrated 
w.r.t. the complex {K{1)\ I G Cq}. 

Blackwell's characterization of approachable convex sets (see Blackwell 
[5], Theorem 3) implies that the predictor can approach the convex set Q.- 
if (and only if) for every mixed action of Nature in A.{J'), he has an action 
X G A(£o) such that the expected payoff is in Given y.„ G ^{J)-, 

choosing l{yn) G -Co, where l{yn) is the index of the polytope that contains 
Un, ensures that Ey^^;(y^-)[f7„] is in r2_. Therefore there exist calibrated 
strategies with respect to any polytopial complex. □ 

This modification of the definition of calibration does not change the 
other part of our algorithms nor the remaining of the proofs (in particular, 
to calibrate the sequence of unobserved flags, the forecaster must use 7„- 
perturbations). The constants in the rates of convergence are now smaller 
since Lq can be much smaller than L and in F^, E[||C/„|p] is bounded by 

O (^^^ where Tq = sup^g^^^ is the maximum number of hyperplanes 
defining a polytope of the complex. 

The main argument behind this algorithm (i.e. the characterization of 
approachable convex sets of Blackwell [5]) is quite close, in spirit, to the 
one of Lehrer & Solan [21] . Note that however, with our representation, the 
projection on J7_ can be computed linearly in TLq, so polynomially in Lq. 
Therefore, it reduces to the construction of an approachability strategy and 
so - as shown by Blackwell [5] - to the resolution, at each stage, of a linear 
programming of size polynomial in Lq. 

3.2 Extension to the compact case 

We prove in this section that the finiteness of J is not required. 

Assume that instead of choosing j„ at stage n G M - which generates the 
flag /„, = s{in) and an outcome vector ( p{i^ jn) ] - Nature chooses directly 

an outcome vector 0„ G [—1, 1]^ and a flag /„ which belongs to s(0„) where 
s is a multivalued mapping from [—1, 1]^ into A(5)^. As before, the decision 
maker's payoff is O^" (the in-th coordinate of 0„) and he receives a signal 
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Sn whose law is . Strategies of the forecaster and consistency are defined 
as before. 

Theorem 3.3 If the graph of s is a polytope, then there exists an internally 
consistent strategy a such that, for every strategy r of Nature, with Po-,t 
probability at least 1 — 6: 

_m!^«,.,0<o(-J^^+-i3l„(i))^ (17) 

The proof of this result is identical to the one of Theorem 12.101 

Note that the assumption that the graph of s is a polytope is fulfilled 
in the finite dimension case. The mapping s is multivalued since in finite 
dimension there might exist two different mixed actions yi,yi in A(^7) that 
generate the same outcome vectore (i.e. p{-,yi) = p{-,y2) = O) but different 
flags (i.e. /i = s(yi) / 3(1/2) = /2)- Hence we should have /i,/2 G s(0). 

3.3 Strengthening of the constants 

We propose two different ideas to strengthen the constants of our algorithm. 
First, we can use (as did Lugosi, Mannor &: Stoltz |23j ) only one concen- 
tration inequality for every coordinate of the vector U^^n instead of one 
concentration inequality per coordinate. Second, we can implement sparser 
vector payoffs (so that its norm decreases) by looking at a slight different 
definition of calibration. 

3.3.1 Concentration Inequalities in Hilbert Spaces 

The rates of convergence of our algorithms rely mainly on three properties: 
Blackwell's approachability theorem, Hoeffding-Azuma's and Freedman's in- 
equalities. These tools allowed us to study the convergence of a sequence 
of vectors towards 0. Approachability is well defined for sequences of 
vectors, however the two concentration inequalities hold only for real valued 
martingales. To circumvent this issue, we used in the proofs the fact that 
if a process {f/„ G IR'^j^gj^ is a martingale then, for each coordinate, the 
process {U^ G IRj^^j^ is a real valued martingale. This does not use the 
fact that Un might be sparse and the use of concentration inequalities in 
Hilbert space can sharpen the constant. 

Indeed, recall Hoeffding-Azuma's inequality: 
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Lemma 3.4 (Hoeffding |19| . Azuma [3J) Let Un be a sequence of mar- 
tingale differences bounded by K, i.e. for every n G M, Eg-.r [f^n+il^n] = 
and \Un\ < K. 

Then for every n G M and every e > 0: 



P.,.(|C^„|>e)<2exp(^) 



which can be expressed as 



^a,ri\Un\<Kj-ln(-\ \ >l-6. (18) 



n \5 

Chen & White [11] proved an equivalent property for vector martingale in 

Lemma 3.5 (Chen & White jllj ) Let Un be a sequence of martingale 
differences in IR*^ bounded almost-surely by K > 0. Then for every n G W 
and for every e > 0: 



riE^ I / — 77,£:^ \ / ns^ \ 

Pa,r {\\Un\\ >e)< 2max<( 1,^/^ Uxp [^-^ J < 2exp (^-"^ j , 

for every a < 1 — ^ (which equals approximatively 0.81 J. 

Assume that for every n G M, ||?7„||oo < l|f^l|oo and ||t/n.||2 < ll^lb; we 
can deduce from the use of only Hoeffding-Azuma's inequality that: 

However, Chen and White's result, along with the fact that \\Un\\ < L, 
implies that: 



TP f 

r max 



TTl,k 



> e < 2 exp 



4||f/|li 



which can reduce the dependency in L. The effects is even more dramatic 
when estimating the sequences of flags, since e„ has only positive component 
(so ||e„||oo = ||e„||2)- 



There also exist variants of Bernstein's inequality (see e.g. Yurinskii [3T 
in Hilbert spaces that can be used in order to get more precise constants. 
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3.3.2 Calibration with Respect of Neighborhoods 

Definition 3.6 Given a finite set y = {y{l) £ R'^, uj{l) G R; / G £}, y{k) 
is a neighbor of y{l) if k ^ I and the dimension of P{1) n P{k) is equal to 
d-l. 

We defined a calibrated strategy with respect to 3^, as a strategy a such 
that Jn(0 is asymptotically closer to y{l) than to any other y{k) as soon as 
the frequency of / does not go to zero. In fact, jn(0 needs only to be closer to 
y{l) than to any of its neighbors. So one can construct neigiibors-calibrated 
strategies by modifying the algorithm given in Proposition II. 5t the payoff 
at stage n is now denoted by U'^ and is defined by: 

= / ll-^" ~ - Win - if ^ = and /c is a neighbor of / 

^ ""^ otherwise 

The strategy consisting in choosing an invariant measure of {U'^~^ is cali- 
brated and = sup„^<„ Eo-^,- [||C/m|P] equals AJ\f , where M is the maximal 
number of neighbors. This latter can be much smaller than 4, and the gain 
from this modification is limpid if we consider e-calibration. 

Indeed, in order to construct such strategies, we usually take any e- 
discretization of A(J) so that L = 0(e~^"^~^)). However, there exists a 
discretization such that N = 2^^'^^^\ which is independent of £. 

A Proofs of technical results 

This section is devoted to the proofs of previously mentioned results, i.e. 
Lemma 11.121 and Proposition 12.61 

A.l Proof of Lemma 11.121 

Let ^ G £ be fixed, we denote by C = {q G E,'^; t G ?'(/)} the finite family 
of normal vectors to [d— l)-faces of P{1) and by i3 = {6t G R; t G T{1)} the 
family of scalars such that : 

P{1) = {Z G R'^; (Z, ct) < bt, Vt G Til)] . 

Any points satisfying Equation ([5]) belongs to 

Peil) = {Z G R'^; {Z, Ct) <bt + e,ytG T{1)] . 
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For any vertex v of P{1), there exists ti, . . . ,td G T{1) such that 

d 

v=f][zeR''; {Z,ct,) = bt,} 

k=l 

and {ct^ ct^} is a basis of E,'^. If we denote by the point defined by 

d 

k=l 

then Pe{l) is included in the convex huh of every v^- 

Equation ([5]) can be rephrased as: if x belongs to Pe{l) then d{x,P{l)) 
is smaller than Mp£. Therefore it is enough to prove this property for every 
Ve since d{-,P{l)) is a convex mapping thus maximized over a polytope on 
one of its vertices. 

With these notations, for every k £ {1, . . . ,d}, {v£ — v,Ct^) = £ and there 
exists a unique decomposition — v = '}2'k=i ^kCt^, ■ Define the symmetric 
dx d Gram matrix Qi by = {ct^^cty) and a = (ai,...,ad). Then 
following classical properties hold: 

1) \\vs — f IP = a^Qia and there exist a D = diag(Xi, . . . , A^) a diagonal 
matrix with < Ai < . . . < and a d x d matrix P and such that 
p-i = pT and Qi = P^DP; 

2) Qa = e; = (e, . . . , e) therefore a = Q'[^£; 

3) 11^;, - vf = (QY^efQiiQ^^e) = e^P^D-^Pe < e^dX^\ 

Therefore, for any Z £ P^ - and in particular for any point that satisfies 
Equation ([5]) -, \\Z — Ili{Z)\\ < max„ \\vs — v\\ < e.Vd\^Xi . The result 
follows from the fact that L is finite. The constant Mp in Lemma 11.121 is 
smaller than the square root of the inverse of the smallest eigenvalue of all 
Qi times ^/d; it depends on the inner products (cf , Cf) and on the dimension 
of J". 

A. 2 Proof of proposition 12.61 

Definition A.l Let K be a polytope. A correspondence B : K ^ M!^ is 
polytopial constant, if there exists {K{1)\ I G C} a finite polytopial complex 
of K and {x{l); I E £} such that x{l) E B{f) for every f E K{1). 
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Let us now restate Proposition 12. 6t 



Proposition A. 2 BR is polytopial constant. 

This theorem is weU-known and quite useful in the full monitoring case 
(see for example the Lemke-Howson [22] algorithm). In the compact case, 
Proposition 12.61 becomes: 

Proposition A. 3 // s has a polytopial graph, then BR is polytopial con- 
stant. 

The proofs of both propositions rely on polytopial parameterized max-min 
programs defined in the next subsection. 

A. 2.1 Constant Solution of a Polytopial Parameterized Max-Min 
Program 

A Polytopial Parameterized Max-Min Program (PPMP) is defined as fol- 
lows. Let X and y be two Euclidian spaces of respective dimension di and 
^2- Consider the program (Pf) - depending on a parameter / that belongs 
to some polytope J- in R'^^ - that is defined by 

{Pf) '. max min xAy, 

X e X y 
s.t. Dx < d s.t. Ejy < ej 

where A is a di x d2 matrix, {Ef, ef, f G J-"} is a family of matrices and 
vectors (we do not specify the sizes the matrices, as long as each inequality 
makes sense) and D, d are also a fixed matrix and vector such that the 
admissible set D = {x & X; Dx < d} is a polytope. The solution set of 
(Pf) is denoted by B{f) C X and this defines a multivalued mapping B{-) 
from into X. 

Theorem A. 4 Assume that the correspondence S defined by: 

T ^ y 

■ f ^ Sf = {y€y; Efy<ef} 
has a polytopial graph S. Then B : T ^ X is polytopial constant. 

Proof. Before going into full details, we first recall the following properties: 

i) A linear program is minimized on a vertex of the polytopial feasible 
set (this is actually implied by the following point); 
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ii) Rockafella [26], Theorem 27.4, page 270: Given x G V and f £ ii y 
minimizes xAy on Sj then 

-xAeNCs.iy), 

where NCE{y) is the normal cone to the convex set E CW^ at y e E 
defined by : 

NCEiy) = {p^ IR^; {P, z - y), G i?} ; 

iii) Ziegler [32], Example 7.3, page 193: If P is a polytope then the finite 
family {NCp{v); f is a vertex of P} is a polyhedral complex of M!^ 
called a normal fan (i.e. it is a finite family of polyhedra that cover R"' 
and such that each pair has an intersection with empty interior); 

iv) Billera & Sturmfels [1], page 530: Since for every f ^ Sj = n^^(/) 
where n:SC-Fx3^— ^J^is the projection with respect to first 
coordinates, then there exists {K(l); I G C}, a polytopial complex of 

such that the normal fan to Sf is constant on every K{1) (this can 
alternatively be deduced from the following point); 

v) Rambau & Ziegler [25], Proposition 2.4, page 221: On each of these 
polytopes K{1), the mapping f i-^ Sf is linear. In particular, there 
exists a finite family of affine functions Y(l) from K[l) to y such that 
the vertices of 5/ are exactly {y{f); y{-) G Y{1)}. 

Points i) and ii) imply that ii Xf maximizes (Pf) - which is then mini- 
mized at some a vertex of Sf denoted by y/, because of point i) - then it can 
be assumed that —XfA is a vertex of the polytope NCsf{yf) n Va~ where 
T)a- ■= {—xA; X G "D}. Thus B{f), the solution set to (Pf) contains at 
least an element of 

X.f = |x G V]—xA vertex of Ci NCsj>{yf),yf vertex of Sf^ . 

By point iii), the normal fan and therefore Xj are constant on K{1). 
The latter can also be assumed to be finite by taking a unique representant 
X G Xj for every vertices of the intersection of the normal fan and P^-. 
Since the number of different fans is finite, for any / G the solution set 
to (Pf) contains at least an element of the finite set X = IJ^-gj-X/-. 
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Moreover, for every x G X: 
S~^(x) = < f €z J-; min Xj4w > max min x'Ay > 

= W G K(l): min x^d-u > max min x'Ay > 
= U 1 / ^ -^(0> ™™ > min x'Ay > 
= U n U mmxAy>x'Ay'(/) 

= Un U n {f^m;^Ay{f)>x'Ay'{f)}, 
ze/:x'6Xy(.)ey(/)j/{-)ey(/) 

where, respectively, the second hne is a consequence of point iv), the third 
hne of the definition of X and the fourth and fifth lines of points i) and v). 

By point v), the two mapping y(-) and y'{-) are affine on K{1), so each 
possible set 

{feK{iy, xAyif)>x'Ay'if)} 

is a polytope as the intersection of an half-space and the polytope K{1). 
Since, the intersection of a union of polytopes remains a union of polytopes, 
for every x E X, B~^{x) is a finite union of polytopes and B is polytopial 
constant. □ 

We can now prove simultaneously Propositions IA.2] and IXl3l 



A.2.2 Proof of Propositions 1X2] and IATsI 

Since s is linear, its graph, denoted by S, is a polytope. Theorem I A. 41 (with 
T> = A(X)) implies that the solution, denoted by B{f) for every / € J^, of 
the parameterized program 

max min p(x, y) 

is polytopial constant. We denote by {K{1); / G £} a corresponding poly- 
topial complex. If B is constant on K{1), then it is also constant on 
K{1) = {K{1)), which is a finite union of polytopes. □ 
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